In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import random

Sampling Approaches

Direct Sampling


In [1]:

Roulette Wheel Sampling

As the store location PDF is one dimensional, a simple roulette wheel approach is efficienty for sampling. The roulette wheel builds a table by computin the cumulative density function (CDF) and assigning each discrete value in the domain an interval. Values are sampled by generating a random number from a uniform distribution and iterating through the table until the interval containing the randomly-generated number is found.


In [2]:
class RouletteWheelSampler(object):
    def __init__(self, domain, pdf):
        self._wheel = []
        end = 0.0
        for x in domain:
            end += pdf.probability(x)
            self._wheel.append((end, x))
        print end

    def sample(self):
        r = random.random()
        for end, x in self._wheel:
            if r < end:
                return x
        # we should never get here since probabilities
        # should sum to 1
        raise Exception, "Could not pick a value!"

Monte Carlo Methods

Monte Carlo methods are stochastic methods for sampling from complex probability density functions (PDFs). Since complex PDFs are often very difficult to sample directly,


In [3]:
class MonteCarloSampler(object):
    def __init__(self, generator, pdf):
        self.generator = generator
        self.pdf = pdf
        self.num_iter_accept = []

    def sample(self):
        num_iter = 0
        while True:
            num_iter += 1
            obj = self.generator.generate()
            prob = self.pdf.probability(obj)
            rand = random.random()
            if rand < prob:
                self.num_iter_accept.append(num_iter)
                return obj

    def sample_n(self, n):
        samples = []
        for i in xrange(n):
            samples.append(self.sample())
        return samples

    def acceptance_rate(self):
        if len(self.num_iter_accept) == 0:
            return 0.0
        return 1.0 / np.average(self.num_iter_accept)

In [4]:
class GaussianPDF(object):
	def __init__(self, mean, std_dev):
		self.mean = mean
		self.std_dev = std_dev

	def probability(self, x):
		exponent = -1.0 * np.power(x - self.mean, 2.0) / (2.0 * self.std_dev * self.std_dev)
		return np.exp(exponent) / (self.std_dev * np.sqrt(2.0 * np.pi))

class UniformSampler(object):
	def __init__(self, a, b):
		self.a = a
		self.b = b

	def generate(self):
		return random.uniform(self.a, self.b)

In [5]:
mean = 0.0
std = 1.0
narrow_a = -10.0
narrow_b = 10.0
wide_a = -100.0
wide_b = 100.0

In [6]:
gaussian_pdf = GaussianPDF(mean, std)
narrow_uniform_sampler = UniformSampler(narrow_a, narrow_b)
wide_uniform_sampler = UniformSampler(wide_a, wide_b)
narrow_mc = MonteCarloSampler(narrow_uniform_sampler, gaussian_pdf)
wide_mc = MonteCarloSampler(wide_uniform_sampler, gaussian_pdf)

In [7]:
n_samples = 10000
gauss_samples = [random.gauss(mean, std) for i in xrange(n_samples)]
narrow_uniform_samples = [random.uniform(narrow_a, narrow_b) for i in xrange(n_samples)]
wide_uniform_samples = [random.uniform(wide_a, wide_b) for i in xrange(n_samples)]

In [8]:
bins = np.linspace(wide_a, wide_b, num=1000)
narrow_hist, _ = np.histogram(narrow_uniform_samples, bins=bins, density=True)
wide_hist, _ = np.histogram(wide_uniform_samples, bins=bins, density=True)
gauss_hist, _ = np.histogram(gauss_samples, bins=bins, density=True)

In [9]:
plt.clf()
plt.hold(True)
plt.plot(bins[:-1], gauss_hist, 'r-', label="Gaussian PDF")
plt.plot(bins[:-1], narrow_hist, 'g-', label="Narrow MC")
plt.plot(bins[:-1], wide_hist, 'b-', label="Wide MC")
plt.legend(loc="upper right")


Out[9]:
<matplotlib.legend.Legend at 0x10b0fdfd0>

The figure shows the differences between the three distributions. It should be apparent here that the narrow-range uniform distributions overlaps more with the Gaussian distribution than the wide-range uniform distribution. This will be important later.


In [10]:
n_samples = 10000
narrow_samples = narrow_mc.sample_n(n_samples)
wide_samples = wide_mc.sample_n(n_samples)
gauss_samples = [random.gauss(mean, std) for i in xrange(n_samples)]

In [11]:
bins = np.linspace(narrow_a, narrow_b, num=100)
narrow_hist, _ = np.histogram(narrow_samples, bins=bins, density=True)
wide_hist, _ = np.histogram(wide_samples, bins=bins, density=True)
gauss_hist, _ = np.histogram(gauss_samples, bins=bins, density=True)

In [12]:
plt.clf()
plt.hold(True)
plt.plot(bins[:-1], gauss_hist, 'r-', label="Gaussian PDF")
plt.plot(bins[:-1], narrow_hist, 'g-', label="Narrow MC")
plt.plot(bins[:-1], wide_hist, 'b-', label="Wide MC")
plt.legend(loc="upper right")


Out[12]:
<matplotlib.legend.Legend at 0x10b253f10>

I ran the Monte Carlo using the wide and narrow uniform distributions to generate samples. Even though neither of these distributions match the target distribution (Gaussian), the Monte Carlo process ensures that we sample the target distribution. This is the power of a Monte Carlo method: Samples can be generated from any distribution, as long as it generates samples in the correct domain (e.g., $[-\infty, \infty]$). (In our case, I'm assuming that samples outside of the range $[-10, 10]$ have a near-zero probability, hence my choice for the ranges of the uniform distribution.) Regardless of the distribution used for generating the samples, the MC method ensures we sample from the correct distribution.

When the target distribution is very complex, sampling it directly can be impossible. MC methods allow us to overcome this limitation.


In [13]:
print narrow_mc.acceptance_rate(), wide_mc.acceptance_rate()


0.0499385755521 0.00500583179404

This is not to say that the choice of sampling distribution doesn't have any impact -- in fact, it is quite important. The choice of sampling distribution will affect the efficiency of the Monte Carlo method, in particular the acceptance rate. The narrow uniform sampler was more likely to pick values in the high-probability region of the Gaussian distribution, while the wide uniform sampler was more likely to pick values outside of the high-probability region. As a result, the MC method accepted 1 out of every 20 samples generated by the narrow sampler but only 1 out of every 200 samples generated by the wide sampler.